- Feed-forward neural networks
- Recurrent neural networks
- SRN
- LSTM
- Bi-LSTM
- GRU
A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.
\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]
Optimize
objective/cost function \(J\)\((\theta)\)
Generate
error signal that measures difference between predictions and target values
Use error signal to change the
weights and get more accurate predictions
Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function
objective/cost function \(J\)\((\theta)\)
Update each element of \(\theta\):
\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]
Matrix notation for all parameters ( \(\alpha\): learning rate):
\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]
Recursively apply chain rule though each node
Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)
How to avoid overfitting?
Suppose we had the following scenario:
Day 1: Lift Weights
Day 2: Swimming
Day 3: At this point, our model must decide whether we should take a rest day or yoga. Unfortunately, it only has access to the previous day. In other words, it knows we swam yesterday but it doesn’t know whether had taken a break the day before. Therefore, it can end up predicting yoga.
–>
–> –> –>
–>
–> –>
–> –> –>
–>
–>
–> –> –>
–>
–> –> –> –> –> –>
–>
–> –> –>
–> –> –> –> –> –> –> –>
\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]